6)

In this exercise, you will further analyze the Wage data set considered throughout this chapter.

a)

Perform polynomial regression to predict wage using age . Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.

## Warning: `as.tibble()` is deprecated, use `as_tibble()` (but mind the new semantics).
## This warning is displayed once per session.
## [1] 9

## # A tibble: 10 x 6
##    Res.Df      RSS    Df `Sum of Sq`        F  `Pr(>F)`
##     <dbl>    <dbl> <dbl>       <dbl>    <dbl>     <dbl>
##  1   1497 2662903.    NA       NA    NA       NA       
##  2   1496 2541213.     1   121690.   71.8      5.72e-17
##  3   1495 2529834.     1    11380.    6.71     9.67e- 3
##  4   1494 2525651.     1     4182.    2.47     1.16e- 1
##  5   1493 2524861.     1      790.    0.466    4.95e- 1
##  6   1492 2523528.     1     1333.    0.786    3.75e- 1
##  7   1491 2523523.     1        4.97  0.00293  9.57e- 1
##  8   1490 2523253.     1      270.    0.159    6.90e- 1
##  9   1489 2522931.     1      322.    0.190    6.63e- 1
## 10   1488 2522916.     1       14.8   0.00870  9.26e- 1

We see that using cross-validation, a degree 4 polynomial provides the best fit of the data. This lines up with the ANOVA where the p-value between the cubic to quartic is very large.

7)

The Wage data set contains a number of other features not explored in this chapter, such as marital status ( maritl ), job class ( jobclass ), and others. Explore the relationships between some of these other predictors and wage , and use non-linear fitting techniques in order to fit flexible models to the data. Create plots of the results obtained, and write a summary of your findings.

Let’s take a look at marital status and job class versus wage:

Let’s perform an ANOVA:

## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts
## argument ignored

## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts
## argument ignored

## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts
## argument ignored
## Analysis of Deviance Table
## 
## Model 1: wage ~ ns(age, 4)
## Model 2: wage ~ ns(age, 4) + education
## Model 3: wage ~ ns(age, 4) + education + jobclass
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1      2995    4773109                          
## 2      2991    3717338  4  1055770 < 2.2e-16 ***
## 3      2990    3705071  1    12267  0.001653 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

8)

Fit some of the non-linear models investigated in this chapter to the Auto data set. Is there evidence for non-linear relationships in this data set? Create some informative plots to justify your answer.

In this section we’re going to fit natural splines, smoothing splines and local regression.

9)

This question uses the variables dis (the weighted mean of distances to five Boston employment centers) and nox (nitrogen oxides concentration in parts per 10 million) from the Boston data. We will treat dis as the predictor and nox as the response.

e)

Now fit a regression spline for a range of degrees of freedom, and plot the resulting fits and report the resulting RSS. Describe the results obtained.

## # A tibble: 13 x 2
##       df     rss
##    <int>   <dbl>
##  1     3 0.00382
##  2     4 0.00380
##  3     5 0.00364
##  4     6 0.00362
##  5     7 0.00362
##  6     8 0.00359
##  7     9 0.00361
##  8    10 0.00354
##  9    11 0.00355
## 10    12 0.00354
## 11    13 0.00352
## 12    14 0.00352
## 13    15 0.00352

10)

This question relates to the College data set.

a)

Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.

The elbow of the appears with around 6 predictors, so we will use that value.

##   (Intercept)    PrivateYes    Room.Board      Terminal   perc.alumni 
## -4189.2370968  2741.1022318     0.8183897    43.9431321    45.4200124 
##        Expend     Grad.Rate 
##     0.2493200    32.6191462

b)

Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.

## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts
## argument ignored

### c)

Evaluate the model obtained on the test set, and explain the results obtained.

## # A tibble: 1 x 1
##        mse
##      <dbl>
## 1 3610083.

d)

For which variables, if any, is there evidence of a non-linear relationship with the response?

## 
## Call: gam(formula = Outstate ~ Private + s(Room.Board, df = 3) + s(Personal, 
##     df = 3) + s(Terminal, df = 3) + s(perc.alumni, df = 3) + 
##     s(Expend, df = 3), data = college_sample$train)
## Deviance Residuals:
##      Min       1Q   Median       3Q      Max 
## -7690.52 -1335.69   -33.75  1375.82  7070.22 
## 
## (Dispersion Parameter for gaussian family taken to be 3764264)
## 
##     Null Deviance: 6643143400 on 387 degrees of freedom
## Residual Deviance: 1396542566 on 371.0002 degrees of freedom
## AIC: 6994.445 
## 
## Number of Local Scoring Iterations: 2 
## 
## Anova for Parametric Effects
##                         Df     Sum Sq    Mean Sq  F value    Pr(>F)    
## Private                  1 1754371012 1754371012 466.0595 < 2.2e-16 ***
## s(Room.Board, df = 3)    1 1388828290 1388828290 368.9508 < 2.2e-16 ***
## s(Personal, df = 3)      1   33609296   33609296   8.9285  0.002994 ** 
## s(Terminal, df = 3)      1  461667030  461667030 122.6447 < 2.2e-16 ***
## s(perc.alumni, df = 3)   1  282311428  282311428  74.9978 < 2.2e-16 ***
## s(Expend, df = 3)        1  615941976  615941976 163.6288 < 2.2e-16 ***
## Residuals              371 1396542566    3764264                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                        Npar Df  Npar F     Pr(F)    
## (Intercept)                                         
## Private                                             
## s(Room.Board, df = 3)        2  1.2793   0.27946    
## s(Personal, df = 3)          2  1.6708   0.18952    
## s(Terminal, df = 3)          2  2.4039   0.09178 .  
## s(perc.alumni, df = 3)       2  1.9045   0.15035    
## s(Expend, df = 3)            2 26.7278 1.432e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There is evidence of non-linearity for Expend.

11)

In Section 7.7, it was mentioned that GAMs are generally fit using a backfitting approach. The idea behind backfitting is actually quite simple. We will now explore backfitting in the context of multiple linear regression.

Suppose that we would like to perform multiple linear regression, but we do not have software to do so. Instead, we only have software to perform simple linear regression.

Therefore, we take the following iterative approach: we repeatedly hold all but one coefficient estimate fixed at its current value, and update only that coefficient estimate using a simple linear regression. The process is continued until convergence—that is, until the coefficient estimates stop changing. We now try this out on a toy example.

a)

Generate a response Y and two predictors X 1 and X 2 , with n = 100.

b)

Initialise \(\hat{\beta}_1\) to take on a value of your choice

c)

Keeping \(\hat{\beta}_1\) fixed, fit the model:

\[ Y - \hat\beta_1X_1 = \beta_0 + \beta_2X2 + \epsilon \]

d)

Do the same for \(\hat\beta_2\)

e)

Write a for loop to repeat this 1000 times and report the estimates of the loop

## # A tibble: 20 x 4
##    iteration beta_0 beta_1 beta_2
##        <int>  <dbl>  <dbl>  <dbl>
##  1         1  -2.02   2.82  0.508
##  2         2  -2.02   2.82  0.593
##  3         3  -2.02   2.82  0.593
##  4         4  -2.02   2.82  0.593
##  5         5  -2.02   2.82  0.593
##  6         6  -2.02   2.82  0.593
##  7         7  -2.02   2.82  0.593
##  8         8  -2.02   2.82  0.593
##  9         9  -2.02   2.82  0.593
## 10        10  -2.02   2.82  0.593
## 11        11  -2.02   2.82  0.593
## 12        12  -2.02   2.82  0.593
## 13        13  -2.02   2.82  0.593
## 14        14  -2.02   2.82  0.593
## 15        15  -2.02   2.82  0.593
## 16        16  -2.02   2.82  0.593
## 17        17  -2.02   2.82  0.593
## 18        18  -2.02   2.82  0.593
## 19        19  -2.02   2.82  0.593
## 20        20  -2.02   2.82  0.593

f)

Compare the answer to performing a multiple linear regression

## # A tibble: 3 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   -2.03      0.105    -19.3  6.57e-35
## 2 X1             2.84      0.195     14.6  3.68e-26
## 3 X2             0.350     0.192      1.82 7.14e- 2

The values of the intercepts are very close.

e)

Only two iterations were needed for a ‘good’ approximation.

12)

This problem is a continuation of the previous exercise. In a toy example with \(p = 100\), show that one can approximate the multiple linear regression coefficient estimates by repeatedly performing simple linear regression in a backfitting procedure. How many backfitting iterations are required in order to obtain a “good” approximation to the multiple regression coefficient estimates? Create a plot to justify your answer.

We first generate our random predictor matrix \(X\) with \(n = 1000\) and \(p = 100\). We also generate random coefficients, and perform matrix multiplication to calculate \(Y\).